In this project, I explored the relationship between sustainable development indicators reported by the World Bank and life expectancy. I perform mutiliniear regression and create a model to predict country-wide mean life expectancy using variables reported in the Environment, Social, and Governance (ESG) dataset. For me it is fascinating to explore the effect of this myriad of factors on individual well-being. Life-expectancy can be a rough proxy for well-being because many factors, such as clean water and good healthcare, to intergeneration stress, have been shown to affect lifespan.
Install/attach required packages
p<-.libPaths() #where your packages are installed
`%notin%` <- Negate(`%in%`)
pkgs<-c('dplyr', 'tidyr', 'ggplot2', 'corrplot', 'psych', 'plotly', 'car') # packages needed for this project
for (i in 1:length(pkgs)){
pkg<-pkgs[[i]]
if (pkg %notin% rownames(installed.packages())){
install.packages(pkg, p)
}
if (pkg %notin% rownames(installed.packages())){
print(paste("Error installing ", pkg, ". Check Warnings."))
}
}
colnames(wb_tidy)
[1] "Country.Name"
[2] "Access to clean fuels and technologies for cooking (% of population)"
[3] "Access to electricity (% of population)"
[4] "Adjusted savings: natural resources depletion (% of GNI)"
[5] "Adjusted savings: net forest depletion (% of GNI)"
[6] "Agricultural land (% of land area)"
[7] "Agriculture, forestry, and fishing, value added (% of GDP)"
[8] "Annual freshwater withdrawals, total (% of internal resources)"
[9] "Annualized average growth rate in per capita real survey mean consumption or income, total population (%)"
[10] "Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)"
[11] "Children in employment, total (% of children ages 7-14)"
[12] "CO2 emissions (metric tons per capita)"
[13] "Control of Corruption: Estimate"
[14] "Cooling Degree Days"
[15] "Droughts, floods, extreme temperatures (% of population, average 1990-2009)"
[16] "Ease of doing business index (1=most business-friendly regulations)"
[17] "Electricity production from coal sources (% of total)"
[18] "Energy imports, net (% of energy use)"
[19] "Energy intensity level of primary energy (MJ/$2011 PPP GDP)"
[20] "Energy use (kg of oil equivalent per capita)"
[21] "Fertility rate, total (births per woman)"
[22] "Food production index (2004-2006 = 100)"
[23] "Forest area (% of land area)"
[24] "Fossil fuel energy consumption (% of total)"
[25] "GDP growth (annual %)"
[26] "GHG net emissions/removals by LUCF (Mt of CO2 equivalent)"
[27] "GINI index (World Bank estimate)"
[28] "Government Effectiveness: Estimate"
[29] "Government expenditure on education, total (% of government expenditure)"
[30] "Health Index 35"
[31] "Hospital beds (per 1,000 people)"
[32] "Income share held by lowest 20%"
[33] "Individuals using the Internet (% of population)"
[34] "Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)"
[35] "Life expectancy at birth, total (years)"
[36] "Literacy rate, adult total (% of people ages 15 and above)"
[37] "Mammal species, threatened"
[38] "Maximum 5-day Rainfall (25-yr RL)"
[39] "Mean Drought Index (SPEI)"
[40] "Methane emissions (metric tons of CO2 equivalent per capita)"
[41] "Mortality rate, under-5 (per 1,000 live births)"
[42] "Net migration"
[43] "Nitrous oxide emissions (metric tons of CO2 equivalent per capita)"
[44] "Patent applications, residents"
[45] "People using safely managed drinking water services (% of population)"
[46] "People using safely managed sanitation services (% of population)"
[47] "PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)"
[48] "Political Stability and Absence of Violence/Terrorism: Estimate"
[49] "Population ages 65 and above (% of total population)"
[50] "Population density (people per sq. km of land area)"
[51] "Poverty headcount ratio at national poverty lines (% of population)"
[52] "Prevalence of overweight (% of adults)"
[53] "Prevalence of undernourishment (% of population)"
[54] "Proportion of seats held by women in national parliaments (%)"
[55] "Ratio of female to male labor force participation rate (%) (modeled ILO estimate)"
[56] "Regulatory Quality: Estimate"
[57] "Renewable electricity output (% of total electricity output)"
[58] "Renewable energy consumption (% of total final energy consumption)"
[59] "Research and development expenditure (% of GDP)"
[60] "Rule of Law: Estimate"
[61] "School enrollment, primary (% gross)"
[62] "School enrollment, primary and secondary (gross), gender parity index (GPI)"
[63] "Scientific and technical journal articles"
[64] "Strength of legal rights index (0=weak to 12=strong)"
[65] "Terrestrial and marine protected areas (% of total territorial area)"
[66] "Unemployment, total (% of total labor force) (modeled ILO estimate)"
[67] "Unmet need for contraception (% of married women ages 15-49)"
[68] "Voice and Accountability: Estimate"
library(ggplot2)
Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/
Attaching package: ‘ggplot2’
The following objects are masked from ‘package:psych’:
%+%, alpha
wb_select<-read.csv("../data/wb_tidy.csv", row.names = 1)
library(ggplot2)
check_skew<-gather(wb_select, Measure, Value, -Country) # reshape the dataframe for plotting
ggplot(check_skew, aes(Value))+ # make histograms
geom_histogram(bins = 200)+
facet_wrap(~Measure,strip.position = "bottom")+
coord_cartesian(ylim = c(0,50))
rm(check_skew) #remove this dataframe so it doesnt clutter our environment
Most of these variables are right skewed, so they could be transformed with a log transformation # Below, some columns are log-transformed
wb_log<-wb_select%>%
mutate_at(vars(Renewable.electricity.output,
Renewable.energy.consumption,
Terrestrial.and.marine.protected.areas,
CO2.emissions,
PM2.5.air.pollution,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
natural.resources.depletion,
Prevalence.of.undernourishment),
~log1p(1+.))%>% # log after adding 1 to make all data positive
mutate_at(vars(Access.to.electricity,
Access.to.clean.fuels.and.technologies.for.cooking),
~log1p(101-.)) # left skewed variable transformed after subtracted from 110 to turn it into a left skew
wb_scaled<-wb_log%>%
mutate_at(c('Renewable.electricity.output',
'Access.to.electricity',
'Renewable.energy.consumption',
'Terrestrial.and.marine.protected.areas',
'Agricultural.land',
'Individuals.using.the.Internet',
'CO2.emissions',
'GDP.growth',
'Political.Stability.and.Absence.of.Terrorism',
'Rule.of.Law',
'Government.Effectiveness',
'PM2.5.air.pollution',
'Proportion.of.seats.held.by.women.in.parliaments',
'Access.to.clean.fuels.and.technologies.for.cooking',
'Labor.force.participation.rate',
'Unemployment',
'School.enrollment',
'Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions',
'natural.resources.depletion',
'Prevalence.of.undernourishment',
'Life.expectancy.at.birth'), ~(.-min(.))/(max(.)-min(.)))
check_scaling<-wb_scaled%>%
select(-c(Access.to.clean.fuels.and.technologies.for.cooking, Access.to.electricity, Prevalence.of.undernourishment))%>%
gather(Measure, Value, -Country) # re-shape for plotting
check_scaling$Measure<-as.factor(check_scaling$Measure)
ggplot(check_scaling, aes(Value))+ # make histograms
geom_histogram(bins = 100)+
facet_wrap(~Measure,strip.position = "bottom")
rm(check_scaling) #remove this dataframe so it doesnt clutter our environment
Now the data looks way more normally distributed
wb_transformed<-read.csv("../data/wb_transformed.csv",row.names = 1)
library(psych)
pairs.panels(wb_transformed[,2:10],
method = "pearson",
hist.col = "#00AFBB")
wb_transformed<-read.csv("../data/wb_transformed.csv",row.names = 1)
corrplot(cor(wb_transformed[2:10]),order="hclust",tl.col="black",tl.cex=.62)
# re-shape the data for plotting
wb_long<-gather(wb_transformed, Measure, Value, -Country, -Life.expectancy.at.birth)
wb_long$Measure<-as.factor(wb_long$Measure)
ggplot(wb_long, aes(x=Value,y=Life.expectancy.at.birth))+
geom_point(size=1,color='forestgreen')+
facet_wrap(~Measure,strip.position = "bottom")
Some of these variable clearly have a nonlinear relationship with life expectancy. Specifically, for CO2.emissions, Rule.of.law, Indiviuals.using.the.internet, and Government.effectiveness, their relationship with life expectancy follows a square root function.
Also, for Access.to.electricity, Renewable.energy.consumption, Access.to.clean.fuels.and.technologies.for.cooking, and Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions, they follow a negative square root relationship. These relationships can be added into the linear model.
I start by constructing a full model with all the variables
fullmodel<-lm(Life.expectancy.at.birth ~
Renewable.electricity.output +
I(sqrt(-(Access.to.electricity)+1)) + # square root of the negative plus 1 (move to positive side of x axis)
I(sqrt(-(Renewable.energy.consumption)+1)) +
Terrestrial.and.marine.protected.areas +
Agricultural.land +
I(sqrt(Individuals.using.the.Internet)) +
I(sqrt(CO2.emissions)) +
GDP.growth +
Political.Stability.and.Absence.of.Terrorism +
I(sqrt(Rule.of.Law)) +
I(sqrt(Government.Effectiveness)) +
PM2.5.air.pollution +
Proportion.of.seats.held.by.women.in.parliaments +
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking)+1)) +
Labor.force.participation.rate +
Unemployment +
School.enrollment +
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions)+1)) +
natural.resources.depletion +
Prevalence.of.undernourishment, data=wb_transformed)
Then, we use backward elimination to select the variables that are predictive of life expectancy. Backward elimination uses the model selection strategy of minimizing AIC. It measures how much information about the data is lost in a model and thus assesses the trade off between model complexity and generalizability (overfitting v underfitting). During each iteration, backward elimination drops the variable that gives the highest AIC until there is no significant drop in AIC. The model that yields the lowest AIC is retained in ‘model’.
Here, we inspect the model closer.
summary(model)
Call:
lm(formula = Life.expectancy.at.birth ~ Renewable.electricity.output +
I(sqrt(-(Access.to.electricity) + 1)) + Agricultural.land +
I(sqrt(CO2.emissions)) + I(sqrt(Government.Effectiveness)) +
PM2.5.air.pollution + Proportion.of.seats.held.by.women.in.parliaments +
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking) +
1)) + Unemployment + I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions) +
1)) + Prevalence.of.undernourishment, data = wb_transformed)
Residuals:
Min 1Q Median 3Q Max
-0.199041 -0.046353 0.006253 0.058476 0.253742
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04633 0.07521 0.616 0.53896
Renewable.electricity.output -0.07104 0.02742 -2.591 0.01070 *
I(sqrt(-(Access.to.electricity) + 1)) 0.19896 0.08549 2.327 0.02154 *
Agricultural.land -0.07052 0.02996 -2.354 0.02012 *
I(sqrt(CO2.emissions)) -0.24299 0.07327 -3.316 0.00119 **
I(sqrt(Government.Effectiveness)) 0.42353 0.07209 5.875 3.54e-08 ***
PM2.5.air.pollution -0.07073 0.04340 -1.630 0.10569
Proportion.of.seats.held.by.women.in.parliaments 0.10666 0.04406 2.421 0.01692 *
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking) + 1)) 0.18494 0.06070 3.047 0.00282 **
Unemployment -0.10846 0.03689 -2.940 0.00390 **
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions) + 1)) 0.45359 0.07595 5.972 2.23e-08 ***
Prevalence.of.undernourishment 0.08775 0.05196 1.689 0.09376 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08339 on 126 degrees of freedom
Multiple R-squared: 0.897, Adjusted R-squared: 0.888
F-statistic: 99.71 on 11 and 126 DF, p-value: < 2.2e-16
This output shows is that the model of life expectancy as a function of the above variables was significant (F(11,126) = 99.71, p < 0.01). In addition, R-squared tells us that 89% of variance in life expectancy is accounted for by the variables in this model. This is quite high considering the messy/complicated system of country-wide social and economic factors. This seems to be a good model. FYI I ran a similar model without square root relationship accounted for, and including the square roots improved the model (F(10,127) = 79.1, p < 0.01, R-squared for 0.86).
However, this isn’t the end of the story. We need to make sure that the assumptions of linear models are satisfied in order to be sure that the model is meaningful.
One assumption to inspect is absence of collinearity. If predictors are highly correlated (collinearity) the statistical significance of the model is undermined. We can look at this by looking at the variance inflation factor (VIF) of each predictor. VIF measures how easily the predictor is predicted from a linear regression using the other predictors. Any predictors with a VIF above 4 should be removed from the model.
library(car)
vifs<-dplyr::bind_cols(as.data.frame(names(vif(model))), as.data.frame(unname(vif(model))));vifs
Since several of these variables have a large VIF, some should be removed. While there is no rule in terms of what is a good VIF threshold, the following code removes the variable with the largest VIF and re-builds the model until no variables have VIF above 10.
This new model satisfies the assumption of no collinearity.
Another assumption of that of homoskedasticity, which means that te residuals of the model have a similar amount of deviation from predicted values. To assess this assumption, we can look at a residual plot. This plot should look like a blob; if it doesn’t, for instance if higher values have higher residuals, the variance would not be homoskedastic.
resid <- bind_cols(as.data.frame(fitted(finalmodel)), as.data.frame(residuals(finalmodel)))
p1 <- ggplot(resid,aes(x= fitted(finalmodel), y=residuals(finalmodel))) + geom_point() + labs(title='Residual Plot', x='Fitted Values', y='Residuals');p1
# with this plot, we are look for absence of a pattern
Last, there is the assumption of the normality of residuals. The residuals should follow a normal distribution. If they don’t, this indicates that the model is not fit optimally.
Here is a summary of the new model:
summary(finalmodel)
The final model significantly predicts life expectancy (F(8,120) = 130.7, p < 0.01) and accounts for 89% of variance in life expectancy.
This output also tells at that there are main effects on life expectancy of renewable electricity output (beta=-0.05, p=0.05), agricultural land (beta=0.08, p=0.02), CO2 emissions (beta=-0.22, p=<0.01), government effectiveness score (beta=0.41, p<0.01), proportion of seats held by women (beta=0.011, p=0.02), unemployment rate (beta=-0.011, p<0.01), and cause of death by communicabel diseases, and maternal and prenatal nutrition conditions (beta=0.55, p<0.01).
Some of these coefficients relate the predictor to life expectancy in a linear manner, while others modify the relationship between life expectancy and the square root of the predictor or the square root of the negative of the predictor.
It is important to keep in mind that several transformations have been done to this data before running the linear model. Looking back at the correlations in the original data can therefore be informative.
eval<-read.csv("../data/wb_tidy.csv", row.names = 1)%>% #re-import tidy
select(Country, # select columns that were selected through linear modelling
Renewable.electricity.output,
Agricultural.land,
CO2.emissions,
Government.Effectiveness,
Proportion.of.seats.held.by.women.in.parliaments,
Access.to.clean.fuels.and.technologies.for.cooking,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
Life.expectancy.at.birth)
#print correlations between life expectancy and each predictor in original form, and compare to the coefficients
for (i in 2:length(finalmodel$coefficients)){
var<-colnames(eval)[i]
col<-select(eval, var)
print(paste("Correlation between", var, "and life expectancy : ", cor(col, eval$Life.expectancy.at.birth) ))
print(paste("Coefficient of", names(finalmodel$coefficients)[i], " : ", unname(finalmodel$coefficients)[i]))
}
Here’s a plot of the variables included in the model, the predictors are scaled whereas the outcome of life expectancy is the original value.
library(plotly)
wb_outcome<-read.csv("../data/wb_tidy.csv", row.names = 1)%>% # store outcome
select(Country, Life.expectancy.at.birth)
# dataframe with scaled predictors and original outcome
for_plot<-left_join(wb_transformed, wb_outcome, by='Country')%>%
select(Country,
Renewable.electricity.output,
Agricultural.land,
CO2.emissions,
Government.Effectiveness,
Proportion.of.seats.held.by.women.in.parliaments,
Access.to.clean.fuels.and.technologies.for.cooking,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
Life.expectancy.at.birth.y)
# Pretty column names for graph
colnames(for_plot) <- c('Country',
'Renewable energy consumption (% of total)',
'Agricultural land (% of land area)',
'CO2 emissions (metric tons per capita)',
'Government Effectiveness: Estimate',
'Seats held by women in national parliaments (%)',
'Access to clean fuels and technologies for cooking (%)',
'Unemployment (% of total labor force)',
'CoD by communicable diseases, maternal, and prenatal cond (%)',
'Life expectancy at birth, total (years)')
# Re-shape
for_plot<-gather(for_plot, Measure, Value, -Country, -`Life expectancy at birth, total (years)`)
for_plot$Measure<-as.factor(for_plot$Measure)
p <-ggplotly(ggplot(for_plot,
aes(x = Value,
y = `Life expectancy at birth, total (years)`,
group = Measure,
text = paste("Country:",Country,'<br>', Measure, ':', Value, '<br>Life Expectancy:', `Life expectancy at birth, total (years)`, ' years')))+
geom_point(aes(color=Measure))+
geom_smooth(aes(color=Measure), method=lm,size=.7,se=F)+
labs(title= "Normalized predictors of life expectancy", x="Normalized Measure Value", y="Life expectancy at birth, total (years)")+
theme_minimal(), tooltip = "text")%>%
layout(legend = list(x=1,y=1,font=list(size=8)))
p